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Abstract: Privacy preservation is important for machine learning and data mining, but measures designed to protect 
private information often result in a trade-off: reduced utility of the training samples. This introduces a privacy preserving 
approach that can be applied to decision tree learning, without concomitant loss of accuracy. It describes an approach to 
the preservation of the privacy of collected data samples in cases where information from the sample database has been 
partially lost. This approach converts the original sample data sets into a group of unreal data sets, from which the original 
samples cannot be reconstructed without the entire group of unreal data sets. Meanwhile, an accurate decision tree can be 
built directly from those unreal data sets. This novel approach can be applied directly to the data storage as soon as the first 
sample is collected. The approach is compatible with other privacy preserving approaches, such as cryptography, for extra 
protection. 
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I. Introduction 

Data mining is widely used by researchers for science and business purposes. Data collected (referred to as "sample 
data sets" or "samples") from individuals (referred as "information providers") are important for decision making or pattern 
recognition. Therefore, privacy-preserving processes have been developed to sanitize private information from the samples 
while keeping their utility. 

A large body of research has been devoted to the protection of sensitive information when samples are given to 
third parties for processing or computing [1], [2], [3], [4], [5]. It is in the interest of research to disseminate samples to a 
wide audience of researchers, without making strong assumptions about their trustworthiness. 

Even if information collectors ensure that data are released only to third parties with non-malicious intent (or if a 
privacy preserving approach can be applied before the data are released, there is always the possibility that the information 
collectors may inadvertently disclose samples to malicious parties or that the samples are actively stolen from the collectors. 
Samples may be leaked or stolen anytime during the storing process [6], [7] or while residing in storage [8], [9]. This focuses 
on preventing such attacks on third parties for the whole lifetime of the samples. 

Contemporary research in privacy preserving data mining mainly falls into one of two categories: 1) perturbation 
and randomization-based approaches, and 2) secure multiparty computation (SMC)-based approaches [10]. SMC approaches 
employ cryptographic tools for collaborative data mining computation by multiple parties. Samples are distributed among 
different parties and they take part in the information computation and communication process. SMC research focuses on 
protocol development [11] for protecting privacy among the involved parties [12] or computation efficiency [13]; however, 
centralized processing of samples and storage privacy is out of the scope of SMC. 

We introduce a new perturbation and randomization-based approach that protects centralized sample data sets 
utilized for decision tree data mining. Privacy preservation is applied to sanitize the samples prior to their release to third 
parties in order to mitigate the threat of their inadvertent disclosure or theft. In contrast to other sanitization methods, our 
approach does not affect the accuracy of data mining results. The decision tree can be built directly from the sanitized data 
sets, such that the originals do not need to be reconstructed. Moreover, this approach can be applied at any time during the 
data collection process so that privacy protection can be in effect even while samples are still being collected. 

The following assumptions are made for the scope of this paper: first, as is the norm in data collection processes, a 
sufficiently large number of sample data sets have been collected to achieve significant data mining results covering the 
whole research target. Second, the number of data sets leaked to potential attackers constitutes a small portion of the entire 
sample database. Third, identity attributes (e.g., social insurance number) are not considered for the data mining process 
because such attributes are not meaningful for decision making. Fourth, all data collected are dis-cretized; continuous values 
can be represented via ranged-value attributes for decision tree data mining. 

II. Related Work 

In Privacy Preserving Data Mining: Models and Algorithms [14], Aggarwal and Yu classify privacy preserving data 
mining techniques, including data modification and crypto-graphic, statistical, query auditing and perturbation -based 
strategies. Statistical, query auditing and most crypto-graphic techniques are subjects beyond the focus of this paper. In this 
section, we explore the privacy preservation techniques for storage privacy attacks. 

Data modification techniques maintain privacy by modifying attribute values of the sample data sets. Essentially, 



www.ijmer.com 



809 I Page 



International Journal of Modern Engineering Research (IJMER) 
www.iimer.com Vol.3, Issue.2, March-April. 2013 pp-809-815 ISSN: 2249-6645 



data sets are modified by eliminating or unifying uncommon elements among all data sets. These similar data sets act as 
masks for the others within the group because they cannot be distinguished from the others; every data set is loosely linked 
with a certain number of information providers. K-anonymity [15] is a data modification approach that aims to protect 
private information of the samples by generalizing attributes. K-anonymity trades privacy for utility. Further, this approach 
can be applied only after the entire data collection process has been completed. 

Perturbation-based approaches attempt to achieve privacy protection by distorting information from the original 
data sets. The perturbed data sets still retain features of the originals so that they can be used to perform data mining directly 
or indirectly via data reconstruction. Random substitutions [16] is a perturbation approach that randomly substitutes the 
values of selected attributes to achieve privacy protection for those attributes, and then applies data reconstruction when 
these data sets are needed for data mining. Even though privacy of the selected attributes can be protected, the utility is not 
recoverable because the reconstructed data sets are random estimations of the originals. 

Most cryptographic techniques are derived for secure multiparty computation, but only some of them are applicable 
to our scenario. To preserve private information, samples are encrypted by a function, f, (or a set of functions) with a key, k, 
(or a set of keys); meanwhile, original information can be reconstructed by applying a decryption function, f , (or a set of 
functions) with the key, k, which raises the security issues of the decryption function(s) and the key(s). Building meaningful 
decision trees needs encrypted data to either be decrypted or interpreted in its encrypted form. The (anti)monotone 
framework [17] is designed to preserve both the privacy and the utility of the sample data sets used for decision tree data 
mining. This method applies a series of encrypting functions to sanitize the samples and decrypts them correspondingly for 
building the decision tree. However, this raises the security concerns about the encrypting and decrypting functions. In 
addition to protecting the input data of the data mining process, this approach also protects the output data, i.e., the generated 
decision tree. Still, this output data can normally be considered sanitized because it constitutes an aggregated result and does 
not belong to any individual information provider. In addition, this approach does not work well for discrete -valued 
attributes. 

III. Dataset Complementation Approach 

In Dataset Complementation approach, Unrealized training set algorithm is used. Traditionally, a training set, T s , is 
constructed by inserting sample data sets into a data table. However, a data set complementation approach, requires an extra 
data table, T p . T p is a perturbing set that generates unreal data sets which are used for converting the sample data into an 
unrealized training set, T . The algorithm for unrealizing the training set, T s , is shown as follows: 

Algorithm Unrealize-Training-Set (T s _ T u , T, T p ) 
Input: T s , a set of input sample data sets 

T u , a universal set 

T, a set of output training data sets 

T p , a perturbing set 
Output: <T, T p > 

1 . if T s is empty then return <T',T P > 

2. t <— a dataset in T s 

3 . if T is not an element of T p or T p = { t } then 

4. T p ^T p +T u 

5. T p ^-T p -{t} 

6. t' <— the most frequent dataset in T p 

7. return Unrealize-Training-Set 
(T s -{t},T u ,T+{t'} ,T p -{t'}) 

To unrealize the samples, T s , we initialize both T and T p as empty sets, i.e., we invoke the above algorithm with 
Unrealize-Training-Set (T s , tu,{ },{ })■ The resulting unrealized training set contains some dummy data sets excepting the 
ones in T s . The elements in the resulting data sets are unreal individually, but meaningful when they are used together to 
calculate the information required by a modified ID3 algorithm. 

IV. Decision Tree Generation 

The well-known ID3 algorithm [18] shown above builds a decision tree by calling algorithm Choose- Attribute 
recursively. This algorithm selects a test attribute (with the smallest entropy) according to the information content of the 
training set T s . The information entropy functions are given as 




and 
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Where Kj and Kj are the sets of possible values for the decision attribute, a ; , and test attribute, aj, in T s , respectively, 
and the algorithm Majority-Value retrieves the most frequent value of the decision attribute of T s . 

Algorithm Generate-Tree(T s attribs, default) 
Input: T s , the set of training data sets 

attribs, set of attributes 

default, default value for the goal predicate 
Output: tree, a decision tree 

1 . if T s is empty then return default 

2. default ^Majority- Value(T s ) 

3. if H a . (7^) = 0 then return default 

4. else if attribs is empty then return default 

5. else 

6. best<— Choose-Attribute(attribs; T s ) 

7. tree*— a new decision tree with root attribute best 

8. for each value Vj of best do 

9. T s .*— {datasets inT s as best =K;} 

10. subtree <— Generate-Tree(7y ; attribs-best,default) 

1 1 . connect tree and subtree with a branch labelled K ; 

12. return tree 

Already, we discussed an algorithm that generates an unrealized training set, T', and a perturbing set, T p , from the 
samples in T s . In this section, we use data tables T and T p as a means to calculate the information content and information 
gain of TS, such that a decision tree of the original data sets can be generated based on T and T p . 

4.1 Information Entropy Determination 

From the algorithm Unrealize-Training-Set, it is obvious that the size of T s is the same as the size of T. 
Furthermore, all data sets in (T + T p ) are based on the data sets in T u , excepting the ones in T s , i.e., T s is the q-absolute 
complement of (T + T p ) for some positive integer q. The size of qT u can be computed from the sizes of T and T p , with qT u 
= 2* IT'I+IT P I. Therefore, entropies of the original data sets, T s , with any decision attribute and any test attribute, can be 
determined by the unreal training set, T, and perturbing set, T p . 

4.2 Modified Decision Tree Generation Algorithm 

As entropies of the original data sets, T s , can be determined by the retrievable information — the contents of 
unrealized training set, T, and perturbing set, T p — the decision tree of T s can be generated by the following algorithm. 

Algorithm. Generate-Tree' (size, T, T p , attribs, default) 
Input: size, size of qT u 

T, the set of unreal training data sets 

I* , the set of perturbing data sets 

attribs, set of attributes 

default, default value for the goal predicate 
Output: tree, a decision tree 

1. if (T, T p ) is empty then return default 

2. default ^Minority_Value(T'+T p ) 
3,.ifH a .{q[T' + T p ] c ~) = 0 then return default 

4. else if attribs is empty then return default 

5. else 

d.best-^- Choose-attribute'(attribs,Size,(T',T p )) 

7. ?ree<— a new decision tree with root attribute best 

8. size<— s/ze=number of possible values k in best 

9. for each value Vj of best do 

10. 2V= {data sets in T as best= kj} 

11. p ={data sets in T p as best=k} 

12. subtree-^- Generate-Tree(«'ze, 2^ ', 2^ p ' attribs-best, default) 
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13. connect tree and subtree with a branch labelled ki 

14. return tree 

Similar to the traditional ID3 approach, algorithm Choose-Attribute' selects the test attribute using the ID3 criteria, 
based on the information entropies, i.e., selecting the attribute with the greatest information gain. Algorithm Minority-Value 
retrieves the least frequent value of the decision attribute of (T + T p ), which performs the same function as algorithm 
Majority-Value of the tradition ID3 approach, that is, receiving the most frequent value of the decision attribute of T s . 

To generate the decision tree with T, T p and I qT u I (which equals 2* IT'I+IT P I), a possible value, k d , of the decision 
attribute, a d , (which is an element of A — the set of attributes in T) should be arbitrarily chosen, i.e., we call the algorithm 
Generate-Tree (2* IT'I+IT P I, T s T u , A- a d , k d ). The resulting decision tree of our new ID3 algorithm with unrealized sample 
inputs is the same as the tree generated by the traditional ID3 algorithm with the original samples 

4.3 Data Set Reconstruction 

Section B introduced a modified decision tree learning algorithm by using the unrealized training set, T, and the 
perturbing set, T p . Alternatively, we could have reconstructed the original sample data sets, T s , from T and T p , followed by 
an application of the conventional ID3 algorithm for generating the decision tree from T. The reconstruction process is 
dependent upon the full information of T and T p (whereas q =2* IT'I+IT P I/IT U I); reconstruction of parts of T s based on parts T 
and T p is not possible. 

4.4 Enhanced Protection with Dummy Values 

Dummy values can be added for any attribute such that the domain of the perturbed sample data sets will be 
expanded while the addition of dummy values will have no impact on T s . Dummy represents a dummy attribute value that 
plays no role in the data collection process. In this way we can keep the same resulting decision tree (because the entropy of 
T s does not change) while arbitrarily expanding the size of T u . Meanwhile, all data sets in T and T p , including the ones with 
a dummy attribute value, are needed for determining the entropies of (f) '[7" + 3 nf ] c ) during the decision tree generation 
process. 

4.5 C5.0 algorithm 

In the proposed algorithm, consider C5.0 Algorithm for data mining. The enhancement and the optimization of 
the C4.5 emerge as algorithm C5.0, which exhibits the better performance as compared to the other existing mining 
algorithms. C5.0 algorithm to build either a decision tree or a rule set. A C5.0 model works by splitting the sample based on 
the field that provides the maximum information gain. Each sub sample defined by the first split is then split again, usually 
based on a different field, and the process repeats until the sub samples cannot be split any further. Finally, the lowest-level 
splits are re-examined, and those that do not contribute significantly to the value of the model are removed or pruned. C5.0 
can produce two kinds of models. A decision tree is a straightforward description of the splits found by the algorithm. Each 
terminal (or "leaf") node describes a particular subset of the training data, and each case in the training data belongs to 
exactly one terminal node in the tree. 

In contrast, a rule set is a set of rules that tries to make predictions for individual records. Rule sets are derived 
from decision trees and, in a way, represent a simplified or distilled version of the information found in the decision tree. 
Rule sets can often retain most of the important information from a full decision tree but with a less complex model. Because 
of the way rule sets work, they do not have the same properties as decision trees. The most important difference is that with a 
rule set, more than one rule may apply for any particular record, or no rules at all may apply. If multiple rules apply, each 
rule gets a weighted "vote" based on the confidence associated with that rule, and the final prediction is decided by 
combining the weighted votes of all of the rules that apply to the record in question. If no rule applies, a default prediction is 
assigned to the record. It was introduced an alternative formalism consisting of a list of rules of the form "if A and B and C 
and ... then class X", where rules for each class are grouped together. A case is classified by finding the first rule whose 
conditions are satisfied by the case; if no rule is satisfied, the case is assigned to a default class. Each case belongs to one of a 
small number of mutually exclusive classes. Properties of every case that may be relevant to its class are provided, although 
some cases may have unknown or non-applicable values for some attributes. C5.0 can deal with any number of attributes. 
Rule sets are generally easier to understand than trees since each rule describes a specific context associated with a class. 
Furthermore, a rule set generated from a tree usually has fewer rules than the tree has leaves, another plus for 
comprehensibility. Another advantage of rule set classifiers is that they are often more accurate predictors than decision trees. 

C5.0 decision tree is constructed using GainRatio. GainRatio is a measure incorporating entropy. Entropy (E(S),) 
measures how unordered the data set is. It is denoted by the following equation when there are classes Cj... Cm in data set S 
where P (S c ) is the probability of class C occurring in the data set S: 



E(S-)= -2^P(S c )±log 2 P(S c -) 

c=l 



Information Gain is a measure of the improvement in the amount of order. 
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Gain{ 

Valnes(V') 

Gain has a bias towards variables with many values that partition the data set into smaller ordered sets. In order to 
reduce this bias, the entropy of each variable over its m variable values is calculated as Splitlnfo: GainRatio is calculated by 
dividing Gain by Splitlnfo so that the bias towards variables with large value sets is dampened. 

, - Gain^SX) 

Gain(S, V) = — - — 

Split In fo(S,V) 

C5.0 builds a decision tree greedily by splitting the data on the variable that maximizes gain ratio. A final 
decision tree is changed to a set of rules by converting the paths into conjunctive rules and pruning them to improve 
classification accuracy. 

V. Theoretical Evaluation 

This section provides a concise theoretical evaluation of our approach. For full details on our evaluation process, we 
refer to [19]. 

5.1 Privacy Issues 

Private information could potentially be disclosed by the leaking of some sanitized data sets, T L (a subset of the 
entire collected data table, T D ), to an unauthorized party if 

1 . The attacker is able to reconstruct an original sample, t s , from T L , or 

2. If T L (a data set in T L ) matches t s (a data set in T s ) by chance 

In the scope of this paper, t s is non reconstructable because ITJ is much smaller than IT'+T P I. Hence, we are 
focusing on the privacy loss via matching. Without privacy preservation the collected data sets are the original samples. 
Samples with more even distribution (low variance) have less privacy loss, while data sets with high frequencies are at risk. 
The data set complementation approach solves the privacy issues of those uneven samples. This approach converts the 
original samples into some unrealized data sets[T' + T p ], such that the range of privacy loss is decreased. Data Set 
complementation is in favour of those samples with high variance distribution, especially when some data sets have zero 
counts. However, it does not provide significant improvement for the even cases. 

Adding dummy attribute values effectively improves the effectiveness of the data set complementation approach; 
however, this technique requires the storage size of cRIT u l -IT S I, where c is the counts of the most frequent data set in T s . The 
worst case storage requirement equals (RIT U I-1)*IT S I. 

VI. Experiments 

This section shows the experimental samples of data's from the data set complementation approach, 

1 . normally distributed samples and evenly distributed samples 

2. extremely unevenly distributed samples 

3. Six sets of randomly picked samples, where (i) was generated without creating any dummy attribute values and (ii) was 
generated by applying the dummy attribute technique to double the size of the sample domain. 

For the artificial samples (Tests 1-3), we will study the output accuracy (the similarity between the decision tree 
generated by the regular method and by the new approach), the storage complexity (the space required to store the unrealized 
samples based on the size of the original samples) and the privacy risk (the maximum, minimum, and average privacy loss if 
one unrealized data set is leaked). 

6.1 Output Accuracy 

In all cases, the decision tree(s) generated from the unrealized samples (by algorithm Generate -Tree' described in 
Section 4.2) is the same as the decision tree(s), Tree T s , generated from the original sample by the regular method. This 
result agrees with the theoretical discussion mentioned in Data Set Complementation approach 

6.2 Storage Complexity 

From the experiment, the storage requirement for the data set complementation approach increases from IT S I to 
(2IT U I -1) IT S I, while the required storage may be doubled if the dummy attribute values technique is applied to double the 
sample domain. The best case happens when the samples are evenly distributed, as the storage requirement is the same as for 
the originals. The worst case happens the samples are distributed extremely unevenly. Based on the randomly picked tests, 
the storage requirement for our approach is less than five times (without dummy values) and eight times (with dummy values, 
doubling the sample domain) that of the original samples. 



www.ijmer.com 



813 I Page 



International Journal of Modern Engineering Research (IJMER) 
www.iimer.com Vol.3, Issue.2, March-April. 2013 pp-809-815 ISSN: 2249-6645 



6.3 Privacy Risk 

Without the dummy attribute values technique, the average privacy loss per leaked unrealized data set is small, 
except for the even distribution case (in which the unrealized samples are the same as the originals). By doubling the sample 
domain, the average privacy loss for a single leaked data set is zero, as the unrealized samples are not linked to any 
information provider. The randomly picked tests show that the data set complementation approach eliminates the privacy 
risk for most cases and always improves privacy security significantly when dummy values are used. 

VII. Conclusion 

We introduced a new privacy preserving approach via data set complementation which confirms the utility of 
training data sets for decision tree learning. This approach converts the sample data sets, T s , into some unreal data sets (T'+ 
T p ) such that any original data set is not reconstructable if an unauthorized party were to steal some portion of (T'+ 
T p ) .Meanwhile, there remains only a low probability of random matching of any original data set to the stolen data sets, T L . 
The data set complementation approach ensures that the privacy loss via matching is rang ed from 0 to IT L I*(IT S I/IT U I), where 
T u is the set of possible sample data sets. By creating dummy attribute values and expanding the size of sample domain and 
the privacy loss via matching will be decreased. 

Privacy preservation via data set complementation fails if all training data sets are leaked because the data set 
reconstruction algorithm is generic. Therefore, further research is required to overcome this limitation. As it is very 
straightforward to apply a cryptographic privacy preserving approach, such as the (anti)monotone framework, along with 
data set complementation, this direction for future research could correct the above limitation. This covers the application of 
this new privacy preserving approach with the ID3 decision tree learning algorithm and discrete-valued attributes only. In 
proposed approach, we can develop the application with the help of algorithm, C5.0, and data mining methods with mixed 
discretely — and continuously valued attributes. The storage size of the unrealized samples, the processing time when 
generating a decision tree from those samples and privacy can be increased using C5.0 algorithm for both continuous and 
discrete data sets. When compared with the existing Modified ID3 algorithm, proposed method provide the better results. 
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